Guadeloupe
Inside the App Where Queer Gooners Run Free
In light of Zoom crackdowns and Skype shutting down, Batemates has emerged as an alternative for "bators" who like masturbating together online. One night not long ago, Jaxon Roman sat naked in front of his laptop wearing only a pup hood as he masturbated with single-minded zeal to the attention of eight other men watching onscreen. It was a typical weekday for the 33-year-old Arlington, Virginia, program analyst. "When bros praise me and say they're enjoying [me], I get to that edge point so fast," Roman says. His favorite instances are "when they all come to what I'm doing." Sometimes, when he's feeling especially kinky, Roman, who is bisexual, likes to ask for permission before climaxing.
- North America > United States > Virginia > Arlington County > Arlington (0.24)
- Oceania > Guam (0.08)
- North America > United States > New York (0.04)
- (6 more...)
- Information Technology (0.69)
- Health & Medicine > Therapeutic Area (0.69)
- Leisure & Entertainment > Sports > Olympic Games (0.47)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.47)
AIA Forecaster: Technical Report
Alur, Rohan, Stadie, Bradly C., Kang, Daniel, Chen, Ryan, McManus, Matt, Rickert, Michael, Lee, Tyler, Federici, Michael, Zhu, Richard, Fogerty, Dennis, Williamson, Hayley, Lozinski, Nina, Linsky, Aaron, Sekhon, Jasjeet S.
This technical report describes the AIA Forecaster, a Large Language Model (LLM)-based system for judgmental forecasting using unstructured data. The AIA Forecaster approach combines three core elements: agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and a set of statistical calibration techniques to counter behavioral biases in large language models. On the ForecastBench benchmark (Karger et al., 2024), the AIA Forecaster achieves performance equal to human superforecasters, surpassing prior LLM baselines. In addition to reporting on ForecastBench, we also introduce a more challenging forecasting benchmark sourced from liquid prediction markets. While the AIA Forecaster underperforms market consensus on this benchmark, an ensemble combining AIA Forecaster with market consensus outperforms consensus alone, demonstrating that our forecaster provides additive information. Our work establishes a new state of the art in AI forecasting and provides practical, transferable recommendations for future research. To the best of our knowledge, this is the first work that verifiably achieves expert-level forecasting at scale.
- North America > United States > New York > New York County > New York City (0.04)
- North America > Guadeloupe (0.04)
- Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
- (2 more...)
- Government (1.00)
- Banking & Finance > Trading (1.00)
- Leisure & Entertainment > Games > Chess (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Evaluating Large Language Models for IUCN Red List Species Information
Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
On the Theoretical Limitations of Embedding-Based Retrieval
Weller, Orion, Boratko, Michael, Naim, Iftekhar, Lee, Jinhyuk
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
- North America > Guadeloupe (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
The Judge Variable: Challenging Judge-Agnostic Legal Judgment Prediction
This study examines the role of human judges in legal decision-making by using machine learning to predict child physical custody outcomes in French appellate courts. Building on the legal realism-formalism debate, we test whether individual judges' decision-making patterns significantly influence case outcomes, challenging the assumption that judges are neutral variables that apply the law uniformly. To ensure compliance with French privacy laws, we implement a strict pseudonymization process. Our analysis uses 18,937 living arrangements rulings extracted from 10,306 cases. We compare models trained on individual judges' past rulings (specialist models) with a judge-agnostic model trained on aggregated data (generalist models). The prediction pipeline is a hybrid approach combining large language models (LLMs) for structured feature extraction and ML models for outcome prediction (RF, XGB and SVC). Our results show that specialist models consistently achieve higher predictive accuracy than the general model, with top-performing models reaching F1 scores as high as 92.85%, compared to the generalist model's 82.63% trained on 20x to 100x more samples. Specialist models capture stable individual patterns that are not transferable to other judges. In-Domain and Cross-Domain validity tests provide empirical support for legal realism, demonstrating that judicial identity plays a measurable role in legal outcomes. All data and code used will be made available.
- South America > French Guiana > Guyane > Cayenne (0.04)
- Oceania > New Caledonia > South Province > Noumea (0.04)
- Oceania > French Polynesia > Windward Islands > Papeete (0.04)
- (15 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law > Litigation (1.00)
- Law > Government & the Courts (0.94)
- Law > Criminal Law (0.93)
A Survey of Explainable Reinforcement Learning: Targets, Methods and Needs
The success of recent Artificial Intelligence (AI) models has been accompanied by the opacity of their internal mechanisms, due notably to the use of deep neural networks. In order to understand these internal mechanisms and explain the output of these AI models, a set of methods have been proposed, grouped under the domain of eXplainable AI (XAI). This paper focuses on a sub-domain of XAI, called eXplainable Reinforcement Learning (XRL), which aims to explain the actions of an agent that has learned by reinforcement learning. We propose an intuitive taxonomy based on two questions "What" and "How". The first question focuses on the target that the method explains, while the second relates to the way the explanation is provided. We use this taxonomy to provide a state-of-the-art review of over 250 papers. In addition, we present a set of domains close to XRL, which we believe should get attention from the community. Finally, we identify some needs for the field of XRL.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.28)
- North America > United States > New York > New York County > New York City (0.14)
- Europe > Austria > Vienna (0.14)
- (103 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Health & Medicine (1.00)
- Energy (1.00)
- Education (1.00)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)
WikiVideo: Article Generation from Multiple Videos
Martin, Alexander, Kriz, Reno, Walden, William Gantt, Sanders, Kate, Recknor, Hannah, Yang, Eugene, Ferraro, Francis, Van Durme, Benjamin
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
- Europe > France > Île-de-France > Paris > Paris (0.29)
- North America > The Bahamas (0.14)
- North America > United States > Georgia (0.14)
- (43 more...)
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
Joshi, Siddharth, Nushi, Besmira, Balachandran, Vidhisha, Chandrasekaran, Varun, Vineet, Vibhav, Joshi, Neel, Mirzasoleiman, Baharan
Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
- Europe > United Kingdom > England (0.28)
- North America > Dominican Republic (0.04)
- Europe > United Kingdom > Scotland (0.04)
- (12 more...)
MIRAI: Evaluating LLM Agents for Event Forecasting
Ye, Chenchen, Hu, Ziniu, Deng, Yihe, Huang, Zijie, Ma, Mingyu Derek, Zhu, Yanqiao, Wang, Wei
Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.
- Asia > North Korea (0.14)
- Oceania > Australia > Australian Indian Ocean Territories > Territory of Cocos (Keeling) Islands (0.14)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- (234 more...)
- Law (1.00)
- Government > Foreign Policy (1.00)
- Government > Military (0.93)
- Information Technology (0.92)